Based on the historical data, a company is trying to decide between developping their Website or their mobile App to make the user experience more enjoyable and profitable.
For this purpose, we can perform a machine learning method called linear regression in order to find the answer.
# Import of library
import pandas as pd
# Reading the historical data
data = pd.read_csv("data.csv")
# finding out how many rows and columns are in the dataframe
data.shape
# Looking at the first rows of our dataframe
data.head()
As we can see data of the company, each row representing a single client with the following information:
# Looking at the last lines of our dataframe
data.tail()
# Droping the ID column since it is irrelevant for the anlysis
data.drop('ID', 1,inplace=True)
# Getting an overview of our data
data.describe()
# Making sure there is no NaN
data.isnull().values.any()
We will take a close look at the time spent on both the App and the Website keeping in mind that: a large amount of spent time on one will hint to more business, meaning a wise choice of development and investement on the platform.
For the exploratory process we will use Plotly and Cufflinks which will allow us to create interactive plots.
The libraries can be installed by typing into your command line/terminal:
pip install plotly
pip install cufflinks
import plotly.plotly as py
import plotly.figure_factory as ff
import cufflinks as cf
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
print(__version__)
# Plotting "Time on App" against "Yearly Amount Spent"
data.iplot(kind='scatter',x='Time on App',y='Yearly Amount Spent',mode='markers',size=10,color="green")
# Plotting "Time on Website" against "Yearly Amount Spent"
data.iplot(kind='scatter',x='Time on Website',y='Yearly Amount Spent',mode='markers',size=10,color="green")
fig = ff.create_scatterplotmatrix(data, height=1000, width=900)
iplot(fig, filename='Scatterplot Matrix')
From the previous plot, we can clearly notice a correlation between the "Yearly Amount Spent" and "Length of Membership".
Now, we can move on to linear regression, but first, we should define our dependent variable Y, and then split our data into training and testing to perfom the machine learning.
# Y variable
Y = data['Yearly Amount Spent']
# X : The remaining features
X = data[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
# Splitting data into training set and testing set using sklearn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
# Training the model
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# Testing the model
predictions = lm.predict(X_test)
# Ploting Test against Predicted
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(y_test,predictions, color="green")
plt.xlabel('Y Test')
plt.ylabel('Predictions')
The model can be evaluated using sklearn metrics, by calculating the following:
MAE : Mean Absolute Error
MSE : Mean Squared Error
RMSE : Root Mean Squared Error
# Calculating Metrics
import numpy as np
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
We will take a look at the coefficients and see what we can learn from them.
# Calculating Coeffecients
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
If we compare App vs. Website, we can clearly see that the easy way is investing on developping the App. This choice will be more reasonable to bring more success to the company. In the other hand, the Website can use some attention in order to make it more profitable, and ideas from the accumulated App experience can be used to have a strong user experience from the Website.